---
title: "Muris (2019), Section D.2: Panel data binary choice with incomplete data"
author: "Chris Muris"
date: "May 9, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction and setup

The purpose of this R Markdown file is to reproduce the findings in Appendix D.2 of the manuscript. That subsection follows Example 3 (Section 3 of the manuscript). It is a binary choice model with fixed effects for cross-sectional units that are observed three times (i.e. $T=3$). 

If all cross-section units provided a full set of 3 measurements, we could use `clogit` to estimate the regression coefficient in this model.

We are interested in the case of incomplete data. An observation is incomplete if data is not available for one of three time periods. I am not aware of existing work on dealing with incomplete data in this setting. The approach here is to:

1. Turn the 3 period setup (periods 1,2,3) into 3 sub-setups, each with two periods ((1,2), (1,3), (2,3))
2. For each sub-setup, use `clogit` is data is available for those two time periods
3. Glue it together via stacking (see below; possibly with re-weighting to deal with selection).

Cross-sectional units with at least two measurements contribute to at least one sub-setup. If only one measurement is available, the observation is not informative - this is a fixed effects model.

Start by load some packages that we use below, and set the number of simulations.

```{r}
library(tidyverse) # for data manipulation
library(survival) # for clogit
library(cowplot) # generate the plots in the paper's style
smallS <- 1000 # number of simulations for test runs
bigS <- 10000 # number of simulations for results in manuscripts
```

## Introduction to `clogit`

Start by testing `clogit` itself, which comes from the survival package. The first code chunk generate data from the fixed effects logit model.

```{r}
dgp_binary <- function(n = 1000, T = 3, beta = 1){
  
  # Generate some Xs and us
  df_s <- tibble(
    i = rep(1:n, each = T),
    t = rep(1:T, times = n),
    X = rnorm(n*T),
    u = rlogis(n*T),
    Z = u>0
  ) %>% 
    group_by(i) %>% 
    mutate(
      alphai = mean(X)
    ) %>% 
    ungroup() %>% 
    mutate(
      ystar = X*beta + alphai + u,
      y = ifelse(ystar>=0,1,0)
    ) %>% 
    select(i,t,y,X,Z)
  
  return(df_s)
}

(df_try <- dgp_binary())
```

We can apply `clogit` directly to this.

```{r}
clogit(y ~ X + strata(i), data = df_try)
```

Turn this into a small simulation study.

```{r}
one_sim <- function(s){
  df_s <- dgp_binary()
  b_1 <- glm (y~X, data = df_s, family = binomial)$coefficients["X"]
  b_2 <- clogit(y~X + factor(t) + strata(i), data = df_s)$coefficients["X"]
  return(tibble(s,b_1,b_2))
}
sim_out_1 <- 1:smallS %>% map_df(one_sim)

# Visualize
sim_out_1 %>% 
  gather(b_1,b_2,key = "Estimator", value = "b") %>% 
  ggplot() + geom_boxplot(aes(x=Estimator,y=b))
```

We are interested in applying the `clogit` estimator to a sub-set of the data. This will be useful when we face incomplete data. Our approach will be to apply the estimator to a subsample that has complete data for certain time periods, and using only those time periods. The code chunk below is a prototype.

```{r}
df_try_2 <- df_try %>% filter(t<3)
clogit(y ~ X + strata(i), data = df_try_2)
```

Alternatively, given that this is now a two-period problem, we could transform the data to differences and pass it to `glm`. This yields the same answer as the call to `clogit` above.

```{r}
df_try_2_MD <- df_try %>%
  group_by(i) %>% 
  summarize(
    D12 = ifelse((y[1]+y[2])==1,1,0),
    D2 = y[2],
    DX12 = X[2]-X[1]
  ) %>% 
  filter(D12 == 1)

glm (D2 ~ -1 + DX12, data = df_try_2_MD, family = binomial)
```

The approach to dealing with incomplete data will be to combine the 3 estimators that can be obtained this way - one for periods (1,2); one for periods (1,3); and one for periods (2,3). We will follow the approach in the last chunk: difference and pass to `glm`. If we stack the differenced data for the three estimators, we can pass them all to `glm` for a composite likelihood estimator that uses the information from all three sub-periods.

The following code chunks implements this. First, generate the differenced data for (1,2), (1,3), and (2,3); and stack the results into a data frame.

```{r}
tibble_st <- function(ss,tt,df){
  
  df_out <- df %>%
  group_by(i) %>% 
  summarize(
    D = ifelse((y[ss]+y[tt])==1,1,0),
    D2 = y[tt],
    DX = X[tt]-X[ss]
  ) %>% 
  filter(D == 1)
  
  return(df_out)
}

(df_two <- list(c(1,1,2),c(2,3,3)) %>% pmap_df(tibble_st,df = df_try))
```

Pass the resulting data frame into `glm`.

```{r}
glm (D2 ~ -1 + DX, data = df_two, family = binomial)
```

We can compare the performance of this composite likelihood procedure to the performance of the efficient conditional maximum likelihood procedure based on `clogit` at the 3-period data. Define a function that implements the split-transform-stack-glm estimator discussed above.

```{r}
febc_split <- function(df){
  df_two <- list(c(1,1,2),c(2,3,3)) %>% pmap_df(tibble_st,df = df)
  return(glm (D2 ~ -1 + DX, data = df_two, family = binomial)$coefficients)
}
```

Try both estimators, and add the results from a third one that uses only the first two periods.

```{r}
df_s <- dgp_binary()
clogit(y ~ X + strata(i), data = df_s)$coefficients # three-period
febc_split(df_s) # three x two-period
clogit(y ~ X + strata(i), data = df_s %>% filter(t<3))$coefficients # first two
```

The following function defines one run in a simulation study.

```{r}
one_sim <- function(s){
  
  df_s <- dgp_binary()
  b1 <- clogit(y ~ X + strata(i), data = df_s)$coefficients
  b2 <- febc_split(df_s)
  b3 <- clogit(y ~ X + strata(i), data = df_s %>% filter(t<3))$coefficients

  out <- c(s,b1,b2,b3)
  names(out) <- c("s","three","three-by-two","two")
  return(as_tibble(t(out)))
  
}
one_sim(1)
```

Run a small simulation study.

```{r}
result_2 <- 1:smallS %>% map_df(one_sim)
result_2 %>% gather(three:two,key = "Parameter", value = "Estimate") %>% 
  ggplot() + geom_boxplot(aes(x=Parameter,y=Estimate))

result_2 %>% gather(three:two,key = "Parameter", value = "Estimate") %>% 
  group_by(Parameter) %>% 
  summarize(
    bias = mean(Estimate)-1,
    rmse = mean((Estimate-1)^2)
  )
```

Conclusion: the composite-likelihood/split-transform-stack-glm procedure is almost as good as the conditional maximum likelihood estimator. We will proceed with the former, as it allows us to incorporate incomplete data.

## Add missings

First, add missingness, with a certain probability `p`.

```{r}
add_MCAR <- function(df,p){df %>% mutate(y = ifelse(runif(nrow(df))<p,NA,y))}
```

Try it.

```{r}
(df_s_missings <- add_MCAR(df_s,0.5))
```

Now try running the two estimators on this. 

```{r}
one_sim_missings <- function(s){
  
  df_s <- dgp_binary()
  df_s_md <- add_MCAR(df_s,0.2)
  b1 <- clogit(y ~ X + strata(i), data = df_s)$coefficients
  b2 <- febc_split(df_s)
  b3 <- clogit(y ~ X + strata(i), data = df_s %>% filter(t<3))$coefficients
  
  b4 <- clogit(y ~ X + strata(i), data = df_s_md)$coefficients
  b5 <- febc_split(df_s_md)
  b6 <- clogit(y ~ X + strata(i), data = df_s_md %>% filter(t<3))$coefficients
  
  out <- c(s,b1,b2,b3,b4,b5,b6)
  names(out) <- c("s","three","three-by-two","two","threeMD","three-by-twoMD","twoMD")
  return(as_tibble(t(out)))
  
}
one_sim_missings(1)
```

A small simulation study, and visualize its results.

```{r}
result_3 <- 1:smallS %>% map_df(one_sim_missings)
result_3 %>% gather(three:twoMD,key = "Parameter", value = "Estimate") %>% 
  ggplot() + geom_boxplot(aes(x=Parameter,y=Estimate))
```

Display the results in a table.

```{r}
result_3 %>% gather(three:twoMD,key = "Parameter", value = "Estimate") %>% 
  group_by(Parameter) %>% 
  summarize(
    bias = mean(Estimate)-1,
    rmse = mean((Estimate-1)^2)
  )
```

Now add the "use only complete strata" estimator...

```{r}
balanced_subpanel <- function(df){
  
  df_out <- df %>% mutate(
    missingness = ifelse(is.na(y),0,1)
  ) %>% 
    group_by(i) %>% 
    mutate(
      complete = min(missingness)
    ) %>% 
    ungroup() %>% 
    filter(complete == 1)
 
  return(df_out )
  
}
balanced_subpanel(add_MCAR(df_s,0.2))
```

And add an estimator based on the complete subpanel to the comparison.

```{r}
one_sim_missings_2 <- function(s){
  
  df_s <- dgp_binary()
  df_s_md <- add_MCAR(df_s,0.2)
  b1 <- clogit(y ~ X + strata(i), data = df_s)$coefficients
  b2 <- febc_split(df_s)
  b3 <- clogit(y ~ X + strata(i), data = df_s %>% filter(t<3))$coefficients
  
  b4 <- clogit(y ~ X + strata(i), data = df_s_md)$coefficients
  b5 <- febc_split(df_s_md)
  b6 <- clogit(y ~ X + strata(i), data = df_s_md %>% filter(t<3))$coefficients
  
  df_s_complete <- balanced_subpanel(df_s_md)
  b7 <- clogit(y ~ X + strata(i), data = df_s_complete)$coefficients
  

  out <- c(s,b1,b2,b3,b4,b5,b6,b7)
  names(out) <- c("s","three","three-by-two","two","threeMD","three-by-twoMD","twoMD","threeSUB")
  return(as_tibble(t(out)))
  
}
one_sim_missings_2(1)
```

A small simulation study, and visualization.

```{r}
result_4 <- 1:smallS %>% map_df(one_sim_missings_2)
result_4 %>% gather(three:threeSUB,key = "Parameter", value = "Estimate") %>% 
  ggplot() + geom_boxplot(aes(x=Parameter,y=Estimate))
```

Numerical results.

```{r}
result_4 %>% gather(three:threeSUB,key = "Parameter", value = "Estimate") %>% 
  group_by(Parameter) %>% 
  summarize(
    bias = mean(Estimate)-1,
    rmse = mean((Estimate-1)^2)
  )
```

## Simulation study (no selection)

We can turn this into a small simulation study about how the relative performances of the estimators depends on `p`.

```{r}
one_sim_p <- function(p,s){
  
  df_s <- dgp_binary()
  df_s_md <- add_MCAR(df_s,p)
  # b1 <- clogit(y ~ X + strata(i), data = df_s)$coefficients #oracle
  b2 <- febc_split(df_s) # oracle 3x2
  # b3 <- clogit(y ~ X + strata(i), data = df_s %>% filter(t<3))$coefficients
  
  # b4 <- clogit(y ~ X + strata(i), data = df_s_md)$coefficients
  b5 <- febc_split(df_s_md) # my paper
  # b6 <- clogit(y ~ X + strata(i), data = df_s_md %>% filter(t<3))$coefficients
  
  df_s_complete <- balanced_subpanel(df_s_md)
  b7 <- clogit(y ~ X + strata(i), data = df_s_complete)$coefficients # complete subpanel
  
  out <- c(s,p,b2,b5,b7)
  names(out) <- c("s","p","oracle","this paper","balanced")
  return(as_tibble(t(out)))
  
}
one_sim_p(0.5,1)
```

Now look at a bunch of `p`'s together.

```{r}
one_sim_ps <- function(s,ps) {ps %>% map_df(one_sim_p,s = s)}
one_sim_ps(2,c(0.2,0.7))
```

Now run a whole bunch of simulations.

```{r}
result_5 <- 1:bigS %>% map_df(one_sim_ps,ps = seq(from = 0.1, by = 0.1, to = 0.5))
```

Now look at in in a picture.

```{r}
result_plot <- result_5 %>% gather(oracle:balanced,key = "Estimator", value = "Estimate") %>% 
  mutate(
    Estimator = fct_recode(Estimator,
             "infeasible" = "oracle",
             "this paper" = "this paper",
             "complete case" = "balanced")
    )%>% 
  group_by(Estimator,p) %>% 
  summarize(
    bias = abs(mean(Estimate)-1),
    sd = sd(Estimate),
    rmse = mean((Estimate-1)^2)
  )

plot.bias <- result_plot %>% ggplot() + 
  geom_line(aes(x = p, y = bias, linetype = Estimator)) + 
  theme(legend.position="none") +
  background_grid(major = "xy", minor = "none")

plot.bias 
```

Now generate the plot with standard deviation, and place the plots next to each other.

```{r}
plot.sd <- result_plot %>% ggplot() + 
  geom_line(aes(x = p, y = sd, linetype = Estimator)) + 
  theme(legend.justification=c(0,0), legend.position=c(0.1,0.6)) +
  background_grid(major = "xy", minor = "none")

(fig1 <- plot_grid(plot.sd, plot.bias, labels = "AUTO"))
```

Save the result for the paper.

```{r}
save_plot("binarychoice_noselection.pdf", fig1,
          ncol = 2, # we're saving a grid plot of 2 columns
          nrow = 1, # and 2 rows
          base_aspect_ratio = 1.1 # each individual subplot should have an aspect ratio of 1.1
          )
```

## Estimator under MAR (infeasible)

The following code chunks add `MAR` to the above program. Additional commentary can be found in the manuscript, section D.2.

```{r}
tibble_st_MAR <- function(ss,tt,df){
  
  df_out <- df %>%
  group_by(i) %>% 
  summarize(
    D = ifelse((y[ss]+y[tt])==1,1,0),
    D2 = y[tt],
    DX = X[tt]-X[ss],
    Z1 = Z[ss],
    Z2 = Z[tt]
  ) %>% 
  filter(D == 1)
  
  return(df_out)
}

febc_split_MAR <- function(df,p1,p0){
  
  df_two <- list(c(1,1,2),c(2,3,3)) %>% 
    pmap_df(tibble_st_MAR,df = df) %>% 
    mutate(p = (Z1*p1 + (1-Z1)*p0)*(Z2*p1 + (1-Z2)*p0))
  
  return(glm (D2 ~ -1 + DX, data = df_two, family = binomial, weights = 1/p)$coefficients)

}

df_s <- add_missings_destruction(dgp_binary(),1,0.5)
febc_split_MAR(df_s,1,0.5)
```

Drop this into a simulation.

```{r}
one_sim_reconstruction <- function(s,p1,p0){
  
  df_s <- dgp_binary()
  df_s_md <- add_missings_destruction(df_s,p1,p0)
  # b1 <- clogit(y ~ X + strata(i), data = df_s)$coefficients #oracle
  b2 <- febc_split(df_s) # oracle 3x2
  # b3 <- clogit(y ~ X + strata(i), data = df_s %>% filter(t<3))$coefficients
  
  # b4 <- clogit(y ~ X + strata(i), data = df_s_md)$coefficients
  b5 <- febc_split(df_s_md) # my paper
  # b6 <- clogit(y ~ X + strata(i), data = df_s_md %>% filter(t<3))$coefficients
  
  df_s_complete <- balanced_subpanel(df_s_md)
  b7 <- clogit(y ~ X + strata(i), data = df_s_complete)$coefficients # complete subpanel
  
  b8 <- febc_split_MAR(df_s_md,p1,p0) # using IPW
  
  out <- c(s,b2,b5,b7,b8)
  names(out) <- c("s","oracle","meMCAR","balanced MCAR","meMAR")
  return(as_tibble(t(out)))
  
}
one_sim_reconstruction(1,1,0.5)
```

Run a simulation study.

```{r}
result_7 <- 1:smallS %>% map_df(one_sim_reconstruction,0.9,0.1)
result_plot <- result_7 %>% gather(oracle:meMAR,key = "Parameter", value = "Estimate") %>% 
  group_by(Parameter) %>% 
  summarize(
    bias = abs(mean(Estimate)-1),
    sd = sd(Estimate),
    rmse = mean((Estimate-1)^2)
  ) %>% ungroup()
  
result_7 %>% gather(oracle:meMAR,key = "Parameter", value = "Estimate") %>% ggplot() + geom_boxplot(aes(x=Parameter,y=Estimate))
```

Display the behavior as a function of `p0`, fixing `p1` at 1.

```{r}
final_sim <- function(s){
  is1 <- function(p0,s){one_sim_reconstruction(s,1,p0)}
  
  p0s <- c(0.1,0.2,0.3,0.4,0.5)
  is2 <- p0s %>% map_df(is1,s = s)
  
  return(is2 %>% mutate(p0 = p0s ))
}
final_sim(1)
```

Run a simulation study on this and then visualize it.

```{r}
result_8 <- 1:smallS %>% map_df(final_sim)
result_plot <- result_8 %>% gather(oracle:meMAR,key = "Parameter", value = "Estimate") %>% 
  group_by(Parameter,p0) %>% 
  summarize(
    bias = abs(mean(Estimate)-1),
    sd = sd(Estimate),
    rmse = mean((Estimate-1)^2)
  ) %>% ungroup()
  
result_plot %>% gather(bias:rmse, key = "Measure", value = "Performance") %>%  
  ggplot() + geom_line(aes(x = p0, y = Performance, colour = Parameter)) + facet_wrap(~Measure)
```

## Estimation under MAR (feasible)

Last thing to do is to run the feasible version, where the propensity score is estimated.

```{r}
tibble_st_MAR_feasible <- function(ss,tt,df){
  df_out <- df %>%
  group_by(i) %>% 
  summarize(
    D = ifelse((y[ss]+y[tt])==1,1,0),
    D2 = y[tt],
    DX = X[tt]-X[ss],
    p1hat = p_hat[ss],
    p2hat = p_hat[tt]
  ) %>% 
  filter(D == 1)
  return(df_out)
}

febc_split_MAR_feasible <- function(df){
  a <- glm(D~Z, data = df, family = "binomial")
  df$p_hat <- predict(a, type = "response")
  
  df_two <- list(c(1,1,2),c(2,3,3)) %>% 
    pmap_df(tibble_st_MAR_feasible,df = df) %>% 
    mutate(p = p1hat*p2hat)
  
  return(glm (D2 ~ -1 + DX, data = df_two, family = binomial, weights = 1/p)$coefficients)
}
```

Set up the sim.

```{r}
one_sim_refeas <- function(s,p1,p0){
  
  df_s <- dgp_binary()
  df_s_md <- add_missings_destruction(df_s,p1,p0)
  b2 <- febc_split(df_s) 
  
  b5 <- febc_split(df_s_md) # my paper
  
  df_s_complete <- balanced_subpanel(df_s_md)
  b7 <- clogit(y ~ X + strata(i), data = df_s_complete)$coefficients # complete subpanel
  
  b8 <- febc_split_MAR(df_s_md,p1,p0) # using IPW
  b9 <- febc_split_MAR_feasible(df_s_md) # using feasible IPW
  
  out <- c(s,b2,b5,b7,b8,b9)
  names(out) <- c("s","oracle","meMCAR","balanced MCAR","meMAR","meMARfeas")
  return(as_tibble(t(out)))
  
}
one_sim_refeas(1,1,0.5)
```

Wrap it.

```{r}
feasible_sim <- function(s){
  is1 <- function(p0,s){one_sim_refeas(s,1,p0)}
  
  p0s <- c(0.1,0.2,0.3,0.4,0.5)
  is2 <- p0s %>% map_df(is1,s = s)
  
  return(is2 %>% mutate(p0 = p0s ))
}
feasible_sim(1)
```

Run the simulation study.

```{r}
result_9 <- 1:bigS %>% map_df(feasible_sim)
result_plot <- result_9 %>% gather(oracle:meMARfeas,key = "Parameter", value = "Estimate") %>% 
  group_by(Parameter,p0) %>% 
  summarize(
    bias = abs(mean(Estimate)-1),
    sd = sd(Estimate),
    rmse = mean((Estimate-1)^2)
  ) %>% ungroup()
  
result_plot %>% gather(bias:rmse, key = "Measure", value = "Performance") %>%  
  ggplot() + geom_line(aes(x = p0, y = Performance, colour = Parameter)) + facet_wrap(~Measure)
```

Generate a publication-quality plot.

```{r}
result_plot <- result_9 %>% 
  select(-meMAR) %>% 
  gather(oracle:meMARfeas,key = "Estimator", value = "Estimate") %>% 
  mutate(
    Estimator = fct_recode(Estimator,
                           "infeasible" = "oracle",
                           "this paper" = "meMARfeas",
                           "complete case" = "balanced MCAR",
                           "this paper, MCAR" = "meMCAR") 
  )%>% 
  group_by(Estimator,p0) %>% 
  summarize(
    bias = abs(mean(Estimate)-1),
    sd = sd(Estimate),
    rmse = mean((Estimate-1)^2)
  )

plot.bias <- result_plot %>% ggplot() + 
  geom_line(aes(x = p0, y = bias, linetype = Estimator)) + 
  theme(legend.position="none") +
  background_grid(major = "xy", minor = "none")

plot.bias
```

Get the other plot.

```{r}
plot.sd <- result_plot %>% ggplot() + 
  geom_line(aes(x = p0, y = sd, linetype = Estimator)) + 
  theme(legend.justification=c(0,0), legend.position=c(0.45,0.6)) +
  background_grid(major = "xy", minor = "none")

plot.sd
```

Wrap and export.

```{r}
(fig2 <- plot_grid(plot.sd, plot.bias, labels = "AUTO"))
save_plot("binarychoice_selection.pdf", fig2,
          ncol = 2, # we're saving a grid plot of 2 columns
          nrow = 1, # and 2 rows
          base_aspect_ratio = 1.1 # each individual subplot should have an aspect ratio of 1.1
          )
```


## Saving the results

```{r}
save.image("20180515_results_binaryFE.RData")
```

